AITopics | gradient distribution

Collaborating Authors

gradient distribution

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

9ecff5455677b38d19f49ce658ef0608-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 14:07:17 GMT

dp-sgd, gradient, gradient distribution, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
(4 more...)

Genre: Research Report (0.47)

Industry: Information Technology > Security & Privacy (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

9ecff5455677b38d19f49ce658ef0608-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-9-2026, 14:07:05 GMT

dp-sgd, gradient noise distribution, section 5, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Neural Information Processing SystemsDec-24-2025, 09:04:16 GMT

Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information. To provide formal and rigorous privacy guarantee, many learning systems now incorporate differential privacy by training their models with (differentially) private SGD. A key step in each private SGD update is gradient clipping that shrinks the gradient of an individual example whenever its l2 norm exceeds a certain threshold. We first demonstrate how gradient clipping can prevent SGD from converging to a stationary point. We then provide a theoretical analysis on private SGD with gradient clipping. Our analysis fully characterizes the clipping bias on the gradient norm, which can be upper bounded by the Wasserstein distance between the gradient distribution and a geometrically symmetric distribution. Our empirical evaluation further suggests that the gradient distributions along the trajectory of private SGD indeed exhibit such symmetric structure. Together, our results provide an explanation why private SGD with gradient clipping remains effective in practice despite its potential clipping bias. Finally, we develop a new perturbation-based technique that can provably correct the clipping bias even for instances with highly asymmetric gradient distributions.

gradient clipping, name change, private sgd, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)

Add feedback

Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Neural Information Processing SystemsAug-15-2025, 11:13:34 GMT

dp-sgd, gradient, gradient distribution, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Ohio > Hamilton County > Cincinnati (0.04)
(4 more...)

Genre: Research Report (0.47)

Industry: Information Technology > Security & Privacy (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

9ecff5455677b38d19f49ce658ef0608-AuthorFeedback.pdf

Neural Information Processing SystemsAug-15-2025, 11:13:23 GMT

We thank the reviewers for their positive and constructive feedback. We address several points in the review below. The bias reduction technique in Section 5 is designed for DP-SGD with clipping. When it is applied to DP-SGD, the update rule is shown below. Typos: Thank you for pointing them out, we will correct the typos.

dp-sgd, gradient noise distribution, section 5, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

On Design Principles for Private Adaptive Optimizers

Ganesh, Arun, McMahan, Brendan, Thakurta, Abhradeep

arXiv.org Artificial IntelligenceJul-3-2025

The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.01129

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Virginia (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Multi-Modal Learning with Bayesian-Oriented Gradient Calibration

Guo, Peizheng, Wang, Jingyao, Guo, Huijie, Li, Jiangmeng, Sun, Chuxiong, Zheng, Changwen, Qiang, Wenwen

arXiv.org Artificial IntelligenceMay-30-2025

Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. However, existing methods mainly aggregate gradients with fixed weights and treat all dimensions equally, overlooking the intrinsic gradient uncertainty of each modality. This may lead to (i) excessive updates in sensitive dimensions, degrading performance, and (ii) insufficient updates in less sensitive dimensions, hindering learning. To address this issue, we propose BOGC-MML, a Bayesian-Oriented Gradient Calibration method for MML to explicitly model the gradient uncertainty and guide the model optimization towards the optimal direction. Specifically, we first model each modality's gradient as a random variable and derive its probability distribution, capturing the full uncertainty in the gradient space. Then, we propose an effective method that converts the precision (inverse variance) of each gradient distribution into a scalar evidence. This evidence quantifies the confidence of each modality in every gradient dimension. Using these evidences, we explicitly quantify per-dimension uncertainties and fuse them via a reduced Dempster-Shafer rule. The resulting uncertainty-weighted aggregation produces a calibrated update direction that balances sensitivity and conservatism across dimensions. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and advantages of the proposed method.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.23071

Country:

North America > United States (0.46)
Asia (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology (0.67)
Health & Medicine > Health Care Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
(2 more...)

Add feedback

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Ma, Chao, Gong, Wenbo, Scetbon, Meyer, Meeds, Edward

arXiv.org Artificial IntelligenceDec-23-2024

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.13148

Country:

Asia > Middle East > Jordan (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Data-Driven Gradient Optimization for Field Emission Management in a Superconducting Radio-Frequency Linac

Goldenberg, Steven, Ahammed, Kawser, Carpenter, Adam, Li, Jiang, Suleiman, Riad, Tennant, Chris

arXiv.org Artificial IntelligenceNov-11-2024

However, since the energy upgrade, CEBAF has suffered from significant FE induced radiation. With RF on, dose Jefferson Lab's Continuous Electron Beam Accelerator rates observed at 30 cm from the beamline are as high Facility (CEBAF) [1] relies on two superconducting as 10 rem/h and 100 rem/h for neutron and gamma radiation, radio-frequency linear accelerators (SRF linacs) to deliver respectively. This level of radiation causes significant high-energy electron beams to nuclear physics experiments damage to beamline components, including vacuum in the four experimental halls [2]. An integral valves, magnets, and cables of beam position monitors part of these linacs are cryomodules which contain and ion pumps. Replacing these components can use multiple SRF cavities. These SRF cavities provide the significant resources. Worse, portions of both linacs are main accelerating gradients to the electron beam, and considered "Radiation Areas" for days or even weeks into currently produce the 12 GeV beam necessary for scientific scheduled downtime, limiting maintenance activities to discovery.

artificial intelligence, gradient, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2411.07018

Country: